{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Table Tutorial\n",
    "\n",
    "[Table](https://hail.is/docs/0.2/hail.Table.html) is Hail's distributed analogue of a data frame or SQL table.  It will be familiar if you've used R or `pandas`, but `Table` differs in 3 important ways:\n",
    "\n",
    "- It is distributed.  Hail tables can store far more data than can fit on a single computer.\n",
    "- It carries global fields.\n",
    "- It is keyed.\n",
    "\n",
    "A `Table` has two different kinds of fields:\n",
    "\n",
    "- global fields\n",
    "- row fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Importing and Reading\n",
    "\n",
    "Hail can [import](https://hail.is/docs/0.2/methods/impex.html) data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc.  It can also *read* (and *write*) a native Hail format.\n",
    "\n",
    "You can read a dataset with [hl.read_table](https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table).  It take a path and returns a `Table`.  `ht` stands for Hail Table.\n",
    "\n",
    "We've provided a method to download and import [the MovieLens dataset](https://grouplens.org/datasets/movielens/100k/) of movie ratings in the Hail native format. Let's read it!\n",
    "\n",
    "F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import hail as hl\n",
    "\n",
    "hl.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hl.utils.get_movie_lens('data/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "users = hl.read_table('data/users.ht')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Exploring Tables\n",
    "\n",
    "The [describe](https://hail.is/docs/0.2/hail.Table.html#hail.Table.describe) method prints the structure of a table: the fields and their types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "users.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "You can view the first few rows of the table using [show](https://hail.is/docs/0.2/hail.Table.html#hail.Table.show).\n",
    "\n",
    "10 rows are displayed by default. Try changing the code in the cell below to `users.show(5)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "users.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "You can [count](https://hail.is/docs/0.2/hail.Table.html#hail.Table.count) the rows of a table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "users.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "You can access fields of tables with the Python attribute notation `table.field`, or with index notation `table['field']`. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "users.occupation.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "users['occupation'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "`users.occupation` and `users['occupation']` are [Hail Expressions](https://hail.is/docs/0.2/expressions.html)\n",
    "\n",
    "Lets peak at their using `show`. Notice that the key is shown as well!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "users.occupation.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Exercise\n",
    "\n",
    "The movie dataset has two other tables: `movies.ht` and `ratings.ht`.  Load these tables and have a quick look around."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
